The Performance of Work Stealing in Multiprogrammed Environments
نویسندگان
چکیده
As small-scale, shared-memory multiprocessors make their way onto desktops, the high-performance parallel applications that run on these machines will have to live alongside other applications, such as editors and web browsers. Unfortunately, unless parallel applications are coscheduled [4] or subject to process control [2], they display poor performance in such multiprogrammed environments [2]. As an alternative to coscheduling or process control, we investigate the use of dynamic, user-level, thread scheduling. In particular, we show that a non-blocking [3] implementation of the work-stealing thread-scheduling algorithm [1] achieves efficient performance even when the number of available processors grows and shrinks over time. All of the experiments in this paper were run on a Sun Ultra Enterprise 5000 with 8 167-Mhz UltraSPARC processors running Solaris 2.5.1, with no modifications. The work-stealing thread scheduler studied in this paper has been implemented in a C++ threads library called Hood, and our applications are coded in C++ using Hood. Hood is implemented on top of the Solaris threads library. Traditionally, multithreaded applications use static partitioning and display poor performance when run in a multiprogrammed environment. Such a program creates some number P of (lightweight) processes (also known as kernel threads), and each process performs a 1=P fraction of the total work. Let T1 denote the work of the computation, which we define as the execution time with P = 1 process running on 1 dedicated processor. With P > 1 processes running on P dedicated processors, if the overhead of creating and synchronizing these processes is small compared to the T1=P work per process, then the execution time will be TP = T1=P , thereby giving a speedup of T1=TP = P . In a multiprogrammed environment, however, we might find that the actual number PA of processors on which the program runs is smaller than the number P of processes. In this case, we can still hope to achieve an execution time of TP = T1=PA, thereby giving a speedup of T1=TP = PA. Unfortunately, as Figure 1(a) shows, this hope is not being realized. This figure shows the measured speedup of several statically partitioned applications for different numbers P of processes.
منابع مشابه
Dynamic Memory ABP Work-Stealing
The non-blocking work-stealing algorithm of Arora, Blumofe, and Plaxton (hencheforth ABP work-stealing) is on its way to becoming the multiprocessor load balancing technology of choice in both Industry and Academia. This highly efficient scheme is based on a collection of array-based deques with low cost synchronization among local and stealing processes. Unfortunately, the algorithm’s synchron...
متن کاملHood: A User-Level Threads Library for Multiprogrammed Multiprocessors
The Hood user-level threads library delivers efficient performance under multiprogramming without any need for kernel-level resource management, such as coscheduling or process control. It does so by scheduling threads with a non-blocking implementation of the work-stealing algorithm. With this implementation, the execution time of a program running with arbitrarily many processes on arbitraril...
متن کاملDynamic Processor Allocation for Adaptively Parallel Work-Stealing Jobs
This thesis addresses the problem of scheduling multiple, concurrent, adaptively parallel jobs on a multiprogrammed shared-memory multiprocessor. Adaptively parallel jobs are jobs for which the number of processors that can be used without waste varies during execution. We focus on the specific case of parallel jobs that are scheduled using a randomized work-stealing algorithm, as is used in th...
متن کاملHistory-Based Adaptive Work Distribution
Exploiting parallelism of increasingly heterogeneous parallel architectures is challenging due to the complexity of parallelism management. To achieve high performance portability whilst preserving high productivity, high-level approaches to parallel programming delegate parallelism management, such as partitioning and work distribution, to the compiler and the run-time system. Random work stea...
متن کاملEfficient Work Stealing for Portability of Nested Parallelism and Composability of Multithreaded Program
We present performance evaluations of parallel-for loop with work stealing technique. The parallel-for by work stealing transforms the parallel-loop into a form of binary tree by making use of method of divide-and-conquer. Iterations are distributed in the leaves procedures of the binary tree, and the parallel executions are performed by stealing subtrees from the bottom of the tree. The work s...
متن کامل